System and method for determining originality of data content

ABSTRACT

The present invention provides systems and methods for determining the originality of data content. In one embodiment, the determined originality of a particular item (e.g., a book) as compared to one or more other items can be used as a factor in recommending the item to a user. For example, in one embodiment, upon a user&#39;s selection of an item (e.g., a book), one or more items that have content most diverse from the selected item are determined and provided to the user. In another embodiment, various versions of an item are compared to each other to determine how content in each version differs from that in another version. In another embodiment, content in a collection of items are compared against content from publicly (freely) available sources (e.g., web pages) to determine the originality of the content in the collection of items.

CROSS-REFERENCE TO RELATED APPLICATION

This application is a continuation of and claims benefit of priority toU.S. patent application Ser. No. 12/886,343, filed Sep. 20, 2010,entitled “SYSTEM AND METHOD FOR DETERMINING ORIGINALITY OF DATACONTENT,” which is hereby incorporated herein by reference in itsentirety.

BACKGROUND

Electronic commerce is an increasingly popular way of selling productsand services, referred to herein collectively and interchangeably as“items,” to consumers. To assist consumers in selecting products andservices, many electronic commerce vendors provide recommendations toconsumers. For example, consumers who are shopping for books may beprovided recommendations to books that are similar or complementary tothe books that they are browsing. Most such recommendations are based onprior consumer activities, such as user behavior data, e.g., records ofitems purchased together or common items viewed by the same consumersand focus on recommending substitute or complementary items.

While some of these recommendations may assist consumers in makingpurchase decisions or discovering new items, these recommendations arenot based on the content of the items. For certain types of items suchas books, it may be desirable for consumers to receive recommendationsthat are based at least in part on the content of the items, rather thansolely on user behavior data.

DESCRIPTION OF THE DRAWINGS

The foregoing aspects and many of the attendant advantages of thisdisclosure will become more readily appreciated as the same becomebetter understood by reference to the following detailed description,when taken in conjunction with the accompanying drawings, wherein:

FIG. 1 is a block diagram depicting an illustrative operatingenvironment including a retail server and a original contentdetermination server for determining the originality of content for aplurality of items in accordance with one embodiment;

FIG. 2 is a block diagram illustrating a computer device that canperform various methods of determining original content in accordancewith one embodiment;

FIG. 3A is a flow diagram depicting an illustrative method ofdetermining originality scores within a collection of items inaccordance with one embodiment;

FIG. 3B shows an example matrix of originality scores in accordance withone embodiment;

FIG. 4A is a flow diagram depicting an illustrative method of usingoriginality scores in recommending diverse items in accordance with oneembodiment;

FIG. 4B is an illustrative user interface in which diverse items arerecommended in accordance with one embodiment;

FIG. 4C is an illustrative user interface showing the use of originalityscores in generating recommendations in accordance with one embodiment;

FIG. 4D is a flow diagram depicting an illustrative method of usingoriginality scores in recommending diverse items in accordance withanother embodiment;

FIG. 5A is a flow diagram depicting an illustrative method ofdetermining the originality of the content of a version of an itemrelative to the content of another version in accordance with oneembodiment;

FIG. 5B is an illustrative user interface showing how the method of FIG.5A aids a user in selecting an item for purchase;

FIG. 6A is a flow diagram depicting an illustrative method ofdetermining the originality of an item relative to publicly availablecontent in accordance with one embodiment; and

FIG. 6B is an illustrative user interface showing how the method of FIG.6A aids a user in selecting an item for purchase.

DETAILED DESCRIPTION

Most recommendation systems focus on recommending substitute items orcomplementary items and are often based on what items the users of thesystems have viewed, purchased or rated. In certain situations, e.g., inselecting a book for purchase, a recommendation based on the originalityor diversity of an item's content may be desirable. For example, it maybe desirable to find books that have original content, as many books areoften compilations of content from previously published books. If theoriginality or diversity of the content of the book is used to make therecommendation, different books of more interest to the user may berecommended. If a recommendation is based on what items have beenviewed, purchased, or rated, the recommended items may simply includeupdated versions of the book, in which most of the content has beenduplicated and is consequently of less interest to the user. Incontrast, if the recommendation is based on the originality of thecontent of the book, a recommendation of another book with more diversecontent, and thus, of more interest to the user, may be made.

Unfortunately, the originality or diversity of content typically cannotbe deduced by user behavior data. While a recommendation system couldsolicit user feedback on originality or diversity of content, requiringexplicit user feedback is undesirable in that many users do not take thetime to provide such feedback. Embodiments disclosed herein providesystems and methods for determining originality of content that do notrely on user feedback, and using the determined originality in makingrecommendations to the user. In one embodiment, the determinedoriginality of the content of a particular item (e.g., a book) ascompared to one or more other items can be used as a factor inrecommending an item to a user. For example, upon a user's selection ofa book, one or more similar or complementary books (based on behaviordata) that have content most diverse from the selected book aredetermined and provided to the user. In another embodiment, variousversions of an item are compared to each other to determine how contentin each version differs from that in another version. In anotherembodiment, content in a collection of items are compared againstcontent from publicly (freely) available sources (e.g., web pages) todetermine the originality of the content in the collection of items.These embodiments are further described below in conjunction with theassociated figures. Although numerous examples provided herein relate tobooks, it is understood that the systems and methods provided herein areapplicable to other types of items for which semi-structured datacontent is available (e.g., magazines, web pages, audio content, videocontent, etc.).

Original Content Determination System

The illustrative operating environment shown in FIG. 1 includes a system100 in which users may view, rate, or purchase one or more items. Thesystem 100 may include an original content determination server 120 thatincludes an original content determination module 125 for determiningoriginality/diversity of items and/or using the determinedoriginality/diversity to provide item recommendations to users of thesystem 100. The environment also includes a retail server 110 thatfacilitates electronic browsing and purchasing of goods and servicesusing various user devices, such as computing device 102. Those skilledin the art will recognize that the computing device 102 may be any of anumber of computing devices that are capable of communicating over anetwork including, but not limited to, a laptop, personal computer,personal digital assistant (PDA), hybrid PDA/mobile phone, mobile phone,electronic book reader, digital media player, and the like. The originalcontent determination server 120, which will be described below in moredetail, may be connected to or in communication with an item data store118 that stores information associated with items available for purchasethrough the retail server 110, and/or items for which information isavailable to users, but which are not currently available to be ordered.Item data stored in the item data store 118 may include any informationrelated to an item that may be of interest to a user or may be usefulfor classifying the item. For example, item data may include, but is notlimited to, price, availability, title, item identifier, item feedback(e.g., user reviews, ratings, etc.), item image, item description, itemattributes, tags associated with the item, etc.

In different embodiments, the item data store 118 may be local to theoriginal content determination server 120, may be local to the retailserver 110, may be remote from both the original content determinationserver 120 and retail server 110, and/or may be a network-based serviceitself. In the environment shown in FIG. 1, a user of the system 100 mayutilize the computing device 102 to communicate with the retail server110 via a communication network 108, such as the Internet or othercommunications link. The network 108 may be any wired network, wirelessnetwork or combination thereof. In addition, the network 108 may be apersonal area network, local area network, wide area network, cablenetwork, satellite network, cellular telephone network, etc. orcombination thereof. Protocols and components for communicating via theInternet or any of the other aforementioned types of communicationnetworks are well known to those skilled in the art of computercommunications and, thus, need not be described in more detail herein.

The system 100 is depicted in FIG. 1 as operating in a distributedcomputer environment comprising several computer systems that areinterconnected using one or more computer networks. The system 100 couldalso operate within a computer system having a fewer or greater numberof components than are illustrated in FIG. 1. Thus, the depiction ofsystem 100 in FIG. 1 should be taken as illustrative and not limiting tothe present disclosure. For example, the system 100 could implementvarious Web services components and peer-to-peer network configurationsto implement at least a portion of the processes.

In brief, the retail server 110 is generally responsible for providingfront-end communication with various user devices, such as computingdevice 102, via network 108. The front-end communication provided by theretail server 110 may include generating text and/or graphics, possiblyorganized as a user interface using hypertext transfer or otherprotocols in response to information inquiries received from the varioususer devices. The retail server 110 may obtain information on availablegoods and services (referred to herein as “items”) from one or more datastores (not illustrated), as is done in conventional electronic commercesystems. In certain embodiments, the retail server 110 may also accessitem data from other data sources, either internal or external to system100.

FIG. 2 depicts a general architecture of the original contentdetermination server 120 for determining the originality of content andgenerating recommendations based on the determined originality. Whilethe original content determination system is depicted in FIG. 2 asimplemented by a single computing device (i.e., an original contentdetermination server 120), this is illustrative only. In an actualembodiment, the original content determination system may be embodied ina plurality of computing systems, each executing an instance of theoriginal content determination server. In another embodiment, theoriginal content determination system may be implemented in distributedand/or networked computing systems.

The general architecture of the original content determination server120 depicted in FIG. 2 includes an arrangement of computer hardware andsoftware components that may be used to implement aspects of the presentdisclosure. The original content determination server 120 may includemany more (or fewer) components than those shown in FIG. 2. It is notnecessary, however, that all of these generally conventional componentsbe shown in order to provide an enabling disclosure.

As illustrated in FIG. 2, the original content determination server 120includes a network interface 206, a processing unit 204, an input/outputdevice interface 220, an optional display 202, an optional input device222 and a computer readable medium drive 207, all of which maycommunicate with one another by way of a communication bus. The networkinterface 206 may provide connectivity to one or more networks orcomputing systems. The processing unit 204 may thus receive informationand instructions from other computing systems or services via a network.The processing unit 204 may also communicate to and from memory 210 andfurther provide output information for an optional display 202 via theinput/output device interface 222. Though not shown, the input/outputdevice interface may also accept input from the optional input device222 such as a keyboard, mouse, digital pen, trackpad, touchscreen, etc.

The memory 210 contains computer program instructions that theprocessing unit 204 executes in order to operate the original contentdetermination server. The memory 210 generally includes RAM, ROM and/orother persistent, non-transient memory. The memory 210 may store anoperating system 214 that provides computer program instructions for useby the processing unit 204 in the general administration and operationof the original content determination server 120. The memory 210 mayfurther include computer program instructions and other information forimplementing features of the original content determination system,including the processes described above. For example, in one embodiment,the memory 210 includes a user interface module 212 that generates userinterfaces (and/or instructions therefor) for display upon a clientcomputing device, e.g., via a navigation interface such as a web browserinstalled on the client computing device. In addition, memory 210 mayinclude or communicate with the item data store 118. Item data stored initem data store 118 may include any information related to an item, suchas an item available for purchase, that may be of interest to a consumeror may be useful for classifying the item or determining similarity toother items. Item attributes stored for a given item may depend on thetype of item or a category associated with the item. For example, asdescribed above, certain items such as books may have semi-structureddata (e.g., textual data) stored in the item data store 118 that can bereadily processed by the original content determination server fororiginal content determination. In one embodiment, the item data store118 also stores data related to the scores and matricesdetermined/generated by the processes described herein.

In addition to the user interface module 212, the memory 210 may includethe original content determination module 125 that may be executed bythe processing unit 204. In one embodiment, the original contentdetermination module 125 provides the functionality of the variousmethods/processes described herein, e.g., determining the originality ofthe content of an item in a collection of items relative to the contentof other items in the collection.

Determining Originality Scores

In some embodiments, the original content determination server 120provides to users recommendations based on an evaluation of the contentof the items referenced in the item data store 118. In one embodiment,the evaluation of the content generates originality scores, which arethen used as a basis to recommend items with original or diverse contentto users. In some embodiments, an originality score indicates a degreeto which content of one item in a set of similar items is diverse fromthe content of another item in the set. FIG. 3A is a flow diagramdepicting an illustrative method 300 for determining originality scoreswithin a collection of items in accordance with one embodiment. Thisillustrative method, as well as those described below, may be performedby a computer device such as that depicted in FIGS. 1 and/or 2. In theembodiment of FIG. 3A, the method 300 begins in block 302, where contentfor one or more items available for selection is processed forevaluation and later used in recommendations. This method could, forexample, include extracting textual data from a collection of items(e.g., books) in a catalog (e.g., a product catalog, library catalog,etc.). In other examples where content is in a non-textual form, thecontent may be converted to textual data.

In block 304, an item in the collection, e.g., a product catalog, isprocessed to determine a set of items in the collection that are similarto I_(i). In block 306, originality scores are generated for pairs ofitems, with each pair consisting of (1) an item in the set of items thatis deemed similar to I_(i) and (2) I_(i). In one embodiment, the processrepeats the operations in blocks 304 and 306 for each item in acollection of items, so that a matrix of scores can be generated inblock 308. However, for the sake of simplicity, the description ofblocks 304 and 306 below relates to the processing of one item in thecollection.

Returning to the discussion of block 304, in one embodiment, the method300 compares each of the other items in the collection I₁ to I_(N) to anitem I_(i) to determine items that are similar to item I_(i). Thesimilarity between two items can be determined in a number of ways. Forexample, the similarity between two items may be based on view orpurchase similarities, which measure the likelihood that the two itemsare viewed in the same session (“view similarities”) or purchasedtogether (“purchase similarities”). In one embodiment, when view orpurchase similarities do not exist or are otherwise not available forthe two items under evaluation, the method 300 deems the two items to besimilar if they share the same topic/classification. For example, if thetwo items under evaluation are books, they would be deemed similar ifthey are both computer programming books (i.e., they cover the sametopic). In one embodiment, a topic/classification of an item can beidentified by determining whether both items are assigned to the samebrowse node (e.g., a node within a tree of nodes organized bytopics/classifications), are assigned the same tags, or have the sametextual classification, i.e., identifying the topic of an item by thetextual content of the item.

In one embodiment, once a set of similar items is determined for I_(i),in block 306, the method 300 generates originality scores for pairs ofitems within the set, with each pair consisting of (1) an item in theset of items similar to I_(i), and (2) I_(i). For example, if the set ofsimilar items consists of I₂, I₃, I₄ and I₅, the pairs would be (I_(i),I₂), (I_(i), I₃), (I_(i), I₄) and (I_(i), I₅). In one embodiment, anoriginality score of a pair of items is determined by min-wiseindependent permutations between the two items under evaluation. Themin-wise independent permutations of two items may be computed by firstrepresenting the content of the first item (C_(i)) as one group of wordsand the content of the second item (C_(j)) as another group of words.For example, if an item is a book, the group of words would consist ofall words in the book (excluding duplicates and without reference toorder or grammar). In one embodiment, natural numbers are then assignedto words in each of the two groups of words, with the same wordreceiving the same number assignment. In one embodiment, a group ofwords is a bag of words. For example, the words in the phrase “the sameword receives the same number assignment” may be assigned the numbersshown below:

the 1 same 2 word 3 receives 4 the 1 same 2 number 5 assignment 6

In one embodiment, instead of treating the words individually, n-gramsare used so that natural numbers are assigned to n number of consecutivewords to be grouped and treated as a phrase. The table below shows thenumber assignments of a 2-gram of the same example phrase:

the same 1 word 2 receives the same 1 number 3 assignment

In one embodiment, statistically insignificant words such as “the,” “a,”“or,” “not,” “of,” etc., are removed from the groups of words prior tothe number assignments.

Once natural numbers are assigned to the two groups of words, C, and canbe represented as a set of natural numbers, i.e., C_(i) ⊃N and C_(j) ⊃N.Thus, for two pieces of content, C_(i) and C_(j) the similarity may bemeasured as the size of their set intersection divided by their setunion:

$\mu = \frac{C_{i}\bigcap C_{j}}{C_{i}\bigcup C_{j}}$where μ is the originality score. Accordingly, the smaller theoriginality score, the more original the content of one item is comparedto the other item. In some embodiments, other techniques such asterm-document frequency, inverse document frequency or edit distance maybe used instead of min-wise independent permutations to calculateoriginality scores for pairs of items. In another embodiment, LatentSemantic Analysis (LSA) can be applied to analyze the relationshipsbetween a set of items and the words they contain by producing a set ofconcepts related to the items and words. Applying LSA helps reduce thedimensionality of the data and reduce computationally the number ofcomparisons that need to be performed to derive the similarity results,since comparison can be performed on the set of concepts instead ofindividual words. LSA also helps mitigate the effects of synonymy(different words describing the same idea) and polysemy (same wordhaving multiple meanings) on the comparison process. In one embodiment,the scores (regardless of how calculated) are normalized on a scale from0 to 1 as part of the operation performed in block 306, with 0 denotingthe most diverse and 1 denoting the least diverse.

Returning to FIG. 3A, after the originality scores for all pairs ofitems have been calculated, the calculated (and in one embodiment,normalized) originality scores are stored in a diversity matrix in block308. An example of such a diversity matrix is shown in FIG. 3B. Thediversity matrix of scores is then used in various original/diversecontent recommendation determinations in block 310, as further describedbelow.

Determining Diverse Items

One application of the diversity matrix of scores is providingrecommendations of items that are diverse from an item selected by theuser. Because the diversity matrix of scores includes originality scoresof pairs of items that have been determined to be similar (e.g., haveview/purchase similarities or share a same topic), the scores enhancethe recommendations that are provided to the user. For example, if auser has selected a book A within a product catalog that includes manybooks that are similar to book A, the originality scores can be used torecommend one or more books from among those books that are alreadydeemed similar to book A. The recommended books would thus be relevantto book A because they share view/purchase similarities or same thetopic, but yet would be diverse from book A in content, a characteristicthat is particularly helpful to users who are seeking to discover itemswith original content. Take for example an avid reader of mystery novelsshopping for a new book who may have selected a book written by afamiliar author. Recommendations based on view or purchase similaritiesmay include other books written by the same author or other mysterynovels, many of which the avid reader may have already purchased orread. The diversity matrix of scores in this example can be used toprovide the avid reader recommendations of mystery novels with contentthat is diverse from his/her selected book. This increases theprobability that the recommended books are new to the avid reader.

FIG. 4A is a flow diagram depicting an illustrative method 400 for usingoriginality scores stored in a diversity matrix (such as that shown inFIG. 3B) in recommending diverse items in accordance with oneembodiment. The method 400 begins in block 402, where the method 400receives an indication of an item of interest I_(i) within a collection.The method in block 404 then looks up, in the diversity matrix, theoriginality score for each pair of items in the set, with each pairconsisting of (1) an item in the set of items similar to I_(i), and (2)I_(i). In one embodiment, the diversity matrix is one that has beenpreviously generated by the method 300 shown in FIG. 3A. For example, ifI_(i) is I₁ and the set of items that are similar to I₁ consists of I₂,I₃, I₄ and I₅, the pairs would be (I₁, I₂), (I₁, I₃), (I₁, I₄), and (I₁,I₅), and scores from those pairs are obtained from the matrix shown inFIG. 3B. In block 406, the method 400 stores the obtained score for thecurrent pair in a list. In decision block 408, the method 400 determinesif there are more pairs of items to process. If so, the method 400returns to block 404 and looks up the originality score for the nextpair of items. In the above example, since there are four pairs toprocess, four originality scores are obtained from the diversity matrix.The list of scores may be as follows (with reference to the scores inthe matrix shown in FIG. 3B):(I ₁ ,I ₂)−0.5,(I ₁ ,I ₃)−0.2,(I ₁ ,I ₄)−0.9,(I ₁ ,I ₅)−0.8

Once the scores are obtained and stored in the list, the method 400 maysort the list according to a criterion in block 410. For example and asshown in FIG. 4A, the list is sorted from the least diverse, e.g.,highest score, to the most diverse, e.g., lowest score, (or vice versa).Thus, using the example above, after sorting, the list may appear asfollows:(I ₁ ,I ₄)−0.9,(I ₁ ,I ₅)−0.8,(I ₁ ,I ₂)−0.5,(I ₁ ,I ₃)−0.2.

In one embodiment, the method 400 may optionally filter the sorted listin accordance with a threshold. The filter may be applied to the list toexclude items that are likely not diverse enough to be useful asrecommendations. For example, if a threshold of equal or less than 0.5is applied, the list would be filtered to:(I ₁ ,I ₂)−0.5,(I ₁ ,I ₃)−0.2.

In one embodiment, the method 400 uses the list in block 414 to generaterecommendations of items with diverse content. For example, the itemsmay be computer programming books within a product catalog and I_(i) maybe a particular book selected by the user. FIG. 4B shows an illustrativeuser interface 420 showing how such recommendations are provided to theuser. In one embodiment, upon the selection of an item by the user, themethod 400 may be performed so as to generate a list of items (in thiscase, computer programming books) sorted from the most diverse to theleast diverse (as compared to the user selected item, which is I_(i) inthis example). FIG. 4B shows that the user has selected a book 426 forpossible purchase. The list (or a portion thereof) generated by themethod 400 may be returned to the user to assist the user in selectingone or more books that have the most diverse content as compared to thebook selected by the user. Using the example above, two books, I₂ and I₃on the filtered list, may be provided to the user as recommendations.FIG. 4B shows the two books 422 and 424 that are recommended to theuser. In one embodiment, as shown in FIG. 4B, the user interfaceoptionally includes a user interface control element 428 through whichthe user can indicate a desire to be provided with additionalrecommendations. In one embodiment, once the user indicates this throughthe user interface control element 428, the filter threshold describedabove is adjusted so that more items remain on the list, which leads tomore items being recommended to the user.

In other embodiments, the above described list/scores are used as one ofmany factors in the selection of recommended items. FIG. 4C is anillustrative user interface 430 that shows how, in one embodiment, thelist/scores described above may be used as a factor in a recommendationprocess that takes into account other considerations such as prior useractivity or other user behavior. As shown in the user interface 430, inresponse to the user's selection of a book 432, a number of books 434,436 and 438 are recommended to the user based on purchase similarities(i.e., these books are all known to be frequently purchased with thebook 432). The user interface 430 provides two user interface controlelements 440 and 442 through which these recommendations may be sortedor re-sorted. For example, if the user selects user interface controlelement 440, the books 434, 436, and 438 are sorted by their originalityscores (from most diverse to least diverse) relative to the book 432selected by the user. Similarly, if the user selects user interfacecontrol element 442, books 434, 436, and 438 are sorted by theiroriginality scores (from least diverse to most diverse) relative to thebook 432 selected by the user. In other embodiments, the list ofrecommended books presented to the user is sorted by default from mostdiverse to least diverse.

FIG. 4D is a flow diagram illustrating a method 450 for generatingoriginality scores and recommending diverse items to a user. The method450 is similar to the method 400, except that filtering of the list isperformed in decision block 456, where an additional determination isperformed to determine whether an originality score of a particular pairof items satisfies a threshold. The score is stored in the list in block458 only if the score meets the threshold. For example, if a thresholdof 0.5 is used, only scores that are equal to or less than 0.5 arestored. After all the pairs of items have been processed (as determinedby decision block 460) the list is sorted and stored in block 462 andcan be used in generating recommendations as described above (in block464).

Determining Originality of Content in Different Versions of the SameItem

FIG. 5A is flow diagram depicting an illustrative method 500 fordetermining the originality of the content of a version of an itemrelative to the content of another version of the item. The method 500begins in block 502 by determining various versions or editions of anitem for an originality determination. For example, if an item I_(j) isunder a user's consideration for possible purchase, the method 500identifies the other various versions or editions of I_(j) (if theyexist). Then in block 504, the method 500 determines the originalityscores of pairs of item versions/editions, with each pair in oneembodiment consisting of (1) one of the other version/editionsidentified in block 502 for I_(j), and (2) the item I_(j) itself. Forexample, if there are three other versions/editions of I_(j), namelyI_(j1), I_(j2) and I_(j3), the pairs would be (I_(j), I_(j1)) (I_(j),I_(j2)) and (I_(j), I_(j3)). In one embodiment, the scores for the pairsare generated using the methods described above (e.g., based on min-wiseindependent permutations). In this context, an originality scoreindicates a degree to which content of the one version/edition of theitem is diverse from content of another version/edition of the sameitem. After the scores are generated, in one embodiment, they are storedin a matrix in block 506. In block 508, the matrix of scores is used inproviding the user a diversity measure between versions/editions. In oneembodiment, a diversity measure is a percentage reflecting the contentdifference between two items. In one embodiment, diversity measure iscalculated as follows: diversity measure=(1−originality score of thepair)*100%. Thus, if the originality score is 0.5, the diversity measureis 50%.

FIG. 5B shows an illustrative user interface 510 illustrating how thediversity measure aids a user in selecting an item for purchase. In theexample, item 512 is a textbook that the user has selected for potentialpurchase (I_(j) above). In one embodiment, the method 500 may beexecuted to provide the user the diversity measures of the user selectededition of textbook I_(j) as compared to other editions of I_(j), namelyI_(j1), I_(j2) and I_(j3). As shown, the user selected edition of thetextbook (512) is the latest edition, and the user is shown a number ofmessages 514 with diversity measures that compare the selected editionwith prior editions. In addition to providing the user a quantitativemeasure of the differences, in one embodiment as shown in FIG. 5B, foreach pair of editions, a link 516 is provided to a different userinterface where the actual differences in contents between the editionsare shown to the user. The user can therefore make an informed purchasedecision based on both quantitative and qualitative evaluations of thedifferences between the editions.

The diversity measures may aid the purchase decision in a number ofways. For example, a user who already owns an older version of a bookmay use the diversity measure to decide whether the latest versionincludes sufficient diverse content to justify a purchase.Alternatively, if several versions of the same book are offered forsale, as is the case in the example shown in FIG. 5B, and the diversitymeasures indicate little changes among the versions, the user may opt topurchase an older (and presumably cheaper) version of the book. Forexample, in FIG. 5B, the user may opt to purchase the third edition ofthe textbook since there is only a 2% difference in content between thethird edition and the latest one and the third edition is $50 less thanthe latest edition. As discussed above, the user can also review acomparison of the content differences that is accessible via the link516 to determine if the 2% difference covers important changes.

Determining Originality of Item Content Relative to Publicly AvailableContent

FIG. 6A is a flow diagram depicting an illustrative method 600 fordetermining the originality of items relative to publicly (or freely)available items. Although the example illustrated involves items in aproduct catalog, the applicability of the process 600 is not limited toproduct catalogs. Rather, the process is applicable to any collection ofitems. The process begins in block 602, where publicly available itemssuch as websites are obtained and classified. In one embodiment, theclassification is based on textual analysis of the content from thepublicly available item. In block 604, the method 600 begins processingitems in a product catalog and comparing them to publicly availableitems. For each item in the product catalog, the method 600 determines,through the operations in blocks 606, 608, 610, 612 and 614, howoriginal the item is compared to the publicly available items. Forexample, in block 606, the method determines whether a current item inthe publicly available item collection (I_(pub-i)) has the same topic asthe current item in the catalog (I_(cat-i)) that is under evaluation. Inone embodiment, the topic determination is made in accordance with theaforementioned methods, including, but not limited to, determining thatthe items are assigned to the same browse node, determining that theitems have the same tags and determining that the items share the sametextual classification.

If the two items are of the same topic, then the method 600 generates,in block 608, an originality score for these two items in accordancewith the methods of generating originality score described above. Forexample, I_(cat-i) may be a recipe book on Italian cooking, andI_(pub-i) may be a website devoted to Italian cuisine. In block 610, thegenerated originally score is stored in a diversity matrix such as thatshown in FIG. 3B. The method 600 then determines, in decision block 612,whether there are additional items in the publicly available itemcollection, and if so, returns to decision block 606 to process thoseitems with respect to the current product catalog item I_(cat-i). Ifnot, the method 600 proceeds to block 614 to calculate an average of theoriginality scores generated (and optionally normalized as describedabove in conjunction with FIG. 3A) for the current product catalog itemI_(cat-i). The average of the originality scores generated provides aquick way to determine how original the content of an item is comparedto publicly available content. In other embodiments, variations other anaverage, such as a mean of the originality scores generated, are used.The example table below lists the originality scores of the exampleItalian recipe book compared to the example five websites deemed to beof the same topic, along with the average originality score:

Originality Score Recipe Book (Normalized on a scale of compared to: 0to 1) cooking website 1 0.4 cooking website 2 0.7 cooking website 3 0.9cooking website 4 0.1 cooking website 5 0.8 Average:  0.58

Next, in decision block 616, the method 600 determines if there areadditional items in the product catalog that need to be processed. Ifso, the process returns to block 604 to process the next item in theproduct catalog. Otherwise, the method 600 completes in block 618. Inone embodiment, at the completion of the method 600, each item in theproduct catalog will have an average originality score. In oneembodiment, the method 600 provides a diversity measure for an item,which is calculated as follows: diversity measure=(1−average originalityscore of the item)*100%. Thus, if the average originality score is 0.58,the diversity measure is 42%.

The method 600 may thus generate diversity measures for various items inthe product catalog. FIG. 6B shows an illustrative user interface 620illustrating how the diversity measures aid a user in selecting an itemfor purchase. Shown in the user interface 620 are several cookbooksavailable for purchase. A diversity measure for each cookbook isprovided to the user. For example, a cookbook 622, titled Lydia'sKitchen, is displayed with the message “Lydia's Kitchen is similar to11% of the publicly available items of the same topic, on average.” Asdiscussed above, publicly available items may include publicly/freelyaccessible websites, for example. In one embodiment, the user interface620 further provides a user interface control element 626 through whichthe user can specify a maximum diversity measure criterion for filteringthe items to be recommended to the user. In another embodiment, itemswith diversity measures meeting a certain threshold specified by defaultare provided to the user as recommendations.

CONCLUSION

All of the methods and processes described above may be embodied in, andfully automated via, software code modules executed by one or moregeneral purpose computers or processors. The code modules may be storedin any type of computer-readable medium or other computer storagedevice. Some or all of the methods may alternatively be embodied inspecialized computer hardware. In addition, the components referred toherein may be implemented in hardware, software, firmware, or acombination thereof.

Conditional language, such as, among others, “can,” “could,” “might,” or“may,” unless specifically stated otherwise, or otherwise understoodwithin the context as used, is generally intended to convey that certainembodiments include, while other embodiments do not include, certainfeatures, elements and/or steps. Thus, such conditional language is notgenerally intended to imply that features, elements and/or steps are inany way required for one or more embodiments or that one or moreembodiments necessarily include logic for deciding, with or without userinput or prompting, whether these features, elements and/or steps areincluded or are to be performed in any particular embodiment.

Any process descriptions, elements or blocks in the flow diagramsdescribed herein and/or depicted in the attached figures should beunderstood as potentially representing modules, segments or portions ofcode which include one or more executable instructions for implementingspecific logical functions or steps in the process. Alternateimplementations are included within the scope of the embodimentsdescribed herein in which elements or functions may be deleted, executedout of order from that shown or discussed, including substantiallyconcurrently or in reverse order, depending on the functionalityinvolved, as would be understood by those skilled in the art.

One skilled in the relevant art will appreciate that the methods andsystems described above may be implemented by one or more computingdevices, such as a memory for storing computer executable components forimplementing the methods shown, for example in FIGS. 3A, 4A, 4D, 5A and6A, as well as a processor unit for executing such components. It willfurther be appreciated that the data and/or components described abovemay be stored on a computer readable medium and loaded into memory of acomputer device using a drive mechanism associated with a computerreadable medium storing the computer executable components (such as aCD-ROM or DVD-ROM). Further, the component and/or data can be includedin a single device or distributed in any manner.

It should be emphasized that many variations and modifications may bemade to the above-described embodiments, the elements of which are to beunderstood as being among other acceptable examples. For example, whilethe various examples provided above relate to electronic commerce,embodiments are applicable to other contexts including aiding libraryusers in selecting books. All such modifications and variations areintended to be included herein within the scope of this disclosure andprotected by the following claims.

What is claimed is:
 1. A computer-implemented method comprising: asimplemented by one or more computing devices configured with specificexecutable instructions, for each of one or more other versions of anitem that are different than a first version of the item, generating anoriginality score for a pairing including the other version of the itemand the first version of the item, the originality score indicating adegree to which content of the other version of the item is diverse fromcontent of the first version of the item, wherein the originality scoreis generated based at least in part on a comparison of the content ofthe other version of the item with the content of the first version ofthe item; and determining a diversity measure for the first version ofthe item that indicates a degree to which the first version of the itemdiffers from the one or more other versions of the item, wherein thediversity measure is determined based at least in part on the one ormore generated originality scores, wherein the diversity measureindicates a percentage or amount of content of the first version of theitem that is different than content of the one or more other versions ofthe item.
 2. The computer-implemented method of claim 1, wherein theitem is a book.
 3. The computer-implemented method of claim 2, whereinthe one or more other versions of the item comprise one or moredifferent editions of the book.
 4. The computer-implemented method ofclaim 1, wherein the comparison of the content of the other version ofthe item with the content of the first version of the item is based onat least one of min-wise independent permutations, term-documentfrequency, and latent semantic analysis.
 5. The computer-implementedmethod of claim 1, wherein the diversity measure for the first versionof the item measures the degree to which the first version of the itemdiffers from two or more other versions of the item.
 6. A systemcomprising: a data store configured to store content of items; and acomputing device, comprising one or more processors, in communicationwith the data store that is configured to: retrieve, from the datastore, content of a first version of an item; retrieve, from the datastore, content of one or more other versions of the item that aredifferent than the first version of the item; for each of the one ormore other versions of the item, generate an originality score for apairing including the other version of the item and the first version ofthe item, the originality score indicating a degree to which content ofthe other version of the item is diverse from content of the firstversion of the item, wherein the originality score is generated based atleast in part on a comparison of the content of the other version of theitem with the content of the first version of the item; and determine adiversity measure for the first version of the item that indicates adegree to which the first version of the item differs from the one ormore other versions of the item, wherein the diversity measure isdetermined based at least in part on the one or more generatedoriginality scores, wherein the diversity measure indicates a percentageor amount of content of the first version of the item that is differentthan content of the one or more other versions of the item.
 7. Thesystem of claim 6, wherein the item is an electronic book.
 8. The systemof claim 6, wherein the comparison of the content of the other versionof the item with the content of the first version of the item is basedon at least one of min-wise independent permutations, term-documentfrequency, and latent semantic analysis.
 9. A computer-readable,non-transitory storage medium storing computer-executable instructionsthat, when executed by a computer system, configure the computer systemto perform operations comprising: retrieving, from an electronic datastore, content of a first version of an item and content of two or moreother versions of the item; for each of the two or more other versionsof the item, generating an originality score for a pairing including theother version of the item and the first version of the item, theoriginality score indicating a degree to which content of the otherversion of the item is diverse from content of the first version of theitem, wherein the originality score is generated based at least in parton a comparison of the content of the other version of the item with thecontent of the first version of the item; and determining a diversitymeasure for the first version of the item that indicates a degree towhich the first version of the item differs from the two or more otherversions of the item, wherein the diversity measure is determined basedat least in part on the two or more generated originality scores. 10.The computer-readable, non-transitory storage medium of claim 9, whereinthe diversity measure indicates a percentage of content of the firstversion of the item that is different than content of the two or moreother versions of the item.
 11. The computer-readable, non-transitorystorage medium of claim 9, wherein the first version of the itemcomprises a particular edition of a book, wherein the two or more otherversions of the item include another edition of the book.
 12. Thecomputer-readable, non-transitory storage medium of claim 9, wherein thecomparison of the content of the other version of the item with thecontent of the first version of the item is based on at least one ofmin-wise independent permutations and term-document frequency.